Mapping of Sequence Reads to the Reference Genomes ◾ 83
SRR769545_mem_rmdup.bam 2> SRR769545_mem_rmdup.log
The above command removes the duplicate reads from the BAM file (paired end). If we
need paired-end reads to be treated as single end, use “-S” option.
2.4.1.7 Descriptive Statistics
Some Samtools utilities including “flagstat”, “coverage”, and “depth” can provide simple
statistics on a BAM file.
samtools flagstat SRR769545_mem_sorted.bam
samtools coverage SRR769545_mem_sorted.bam > coverage.txt
samtools depth SRR769545_mem_sorted.bam > depth.txt
For other Samtools commands, you can check the Samtools documentation which is avail-
able at “http://www.htslib.org/doc/samtools.html”.
2.5 REFERENCE-GUIDED GENOME ASSEMBLY
The reference-guided genome assembly is the use of a reference genome of an organism
as a guide to assemble a new genome. This kind of assembly is used when a genome is
re-sequenced to obtain better quality genome assembly or for variant discovery and hap-
lotype construction. The reference-guided genome assembly is widely used to sequence
the genome of individuals of the same species as of the reference genome to detect the
genotypes that may associate with certain phenotype or diseases such as cancers, viral and
bacterial variants or strains. It can also be used to assemble the genomes of closely related
species who do not have available reference genomes. Sequences of the whole genome of an
individual are used in the assembly. The workflow is the same as shown in Figure 2.13 until
the point of creating SAM/BAM file. The additional step is that the aligned reads are piled
up to create consensus sequences from the overlapped contiguous aligned reads. These con-
sensus sequences are called contigs. From these contigs, only the different bases (variants)
are used to edit the sequence of the reference genome to create a new genome sequence.
For the following practice, you can use any of the SAM files produced by BWA or
Bowtie2 above or you can run the following commands to download the FASTQ file from
the NCBI SRA database and decompress them, to download the human reference genome
from UCSC database and index it, and then to perform read mapping with Bowtie2 to
produce a SAM file:
mkdir ref_guided_ass
cd ref_guided_ass
mkdir data
fasterq-dump --verbose SRR769545
gzip SRR769545_1.fastq
gzip SRR769545_2.fastq
cd ..
mkdir ref